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Abstract 

Background: Historically, extended haplotypes have been defined using only a few data points, such as alleles for 
several HLA genes in the MHC. High-density SNP data, and the increasing affordability of whole genome SNP 
typing, creates the opportunity to define higher resolution extended haplotypes. This drives the need for new tools 
that support quantification and visualization of extended haplotypes as defined by as many as 2000 SNPs. 
Confronted with high-density SNP data across the major histocompatibility complex (MHC) for 2,300 complete 
families, compiled by the Type 1 Diabetes Genetics Consortium (T1DGC), we developed software for studying 
extended haplotypes. 

Methods: The software, called ExHap (Extended Haplotype), uses a similarity measurement we term congruence to 
identify and quantify long-range allele identity. Using ExHap, we analyzed congruence in both the T1DGC data and 
family-phased data from the International HapMap Project. 

Results: Congruent chromosomes from the T1DGC data have between 96.5% and 99.9% allele identity over 1,818 
SNPs spanning 2.64 megabases of the MHC {HLA-DRB1 to HLA-A). Thirty-three of 132 DQ-DR-B-A defined haplotype 
groups have > 50% congruent chromosomes in this region. For example, 92% of chromosomes within the DR3-B8- 
A1 haplotype are congruent from HLA-DRB1 to HLA-A (99.8% allele identity). We also applied ExHap to all 22 
autosomes for both CEU and YRI cohorts from the International HapMap Project, identifying multiple candidate 
extended haplotypes. 

Conclusions: Long-range congruence is not unique to the MHC region. Patterns of allele identity on phased 
chromosomes provide a simple, straightforward approach to visually and quantitatively inspect complex long-range 
structural patterns in the genome. Such patterns aid the biologist in appreciating genetic similarities and 
differences across cohorts, and can lead to hypothesis generation for subsequent studies. 
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Background 

The structure of genetic variation across the human 
genome is complex and is characterized by blocks of 
extended linkage disequilibrium separated by recombi- 
nation hotspots [1,2]. Extended haplotypes are one way 
to classify these regions of extended linkage disequili- 
brium. Historically, extended haplotypes have been 
defined using only a few data points, such as several 
HLA or complement genes in the major histocompat- 
ibility complex (MHC) [3-13], or 22 SNPs characterizing 
alleles of CD28 (7 SNPs), ICOS (8 SNPs), and CTLA4 (7 
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SNPs) on chromosome 2 [14]. High-density SNP data, 
and the increasing affordability of whole genome SNP 
typing, creates the opportunity to define higher resolu- 
tion extended haplotypes. This drives the need for new 
tools that are able to support quantification and visuali- 
zation of extended haplotypes as defined by as many as 
2000 SNPs. 

Several methods have been described to analyze the 
structure in dense SNP data including haplotype blocks 
and extended haplotype homozygosity (EHH). Haplotype 
block analyses, such as those implemented in the pro- 
gram Haploview [15], typically identify short-range link- 
age disequilibrium from < 1 kilobases (kb) to 173 kb 
[16]. EHH (the probability that two random 
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chromosomes carrying a specific core haplotype are 
identical by state for all loci within a specified region) 
and related metrics have been used to identify regions 
of the genome that show evidence of recent positive 
selection as these regions are characterized by long 
range haplotypes [17-19]. 

However, as the number of nucleotides under consid- 
eration increases, a low level of allele mismatching must 
be accommodated to account for genotyping errors. 
Confronted with high-density SNP data across the MHC 
for 2,300 complete families, compiled by the Type 1 
Diabetes Genetics Consortium (T1DGC), we defined a 
metric that we termed congruence which provides a way 
to quantify allele identity (identical by state alleles) 
across long genomic regions. We developed software, 
called ExHap (Extended Haplotype), which starts with 
phased chromosomes, derives a consensus string, identi- 
fies chromosomes that closely match this consensus, cal- 
culates overall congruence percentages, and computes 
allele identity of these congruent chromosomes. 

In this field and in this text, the word chromosome 
is used in several different contexts. The first is in the 
general sense, e.g. "Human MHC Class 1 genes HLA- 
A, HLA-B, and HLA-C are found on chromosome 6." 
The second sense refers to a specific chromosome 
from an individual person as represented by some set 
of features, e.g. "In the HapMap data for chromosome 
6, each specific chromosome is represented by 91,357 
SNPs across the length of the chromosome" or "The 
T1DGC data set discussed in this work consists of 
9280 chromosomes, represented by 2837 SNPs across 
the range 29.3 Mb to 34.2 Mb." Throughout this work, 
whenever we discuss specific chromosomes as repre- 
sented by a set of features, we will clearly describe 
those features. 

Using these concepts, we have shown that long, highly 
conserved haplotypes with identity of HLA alleles and 
hundreds to thousands of SNPs are frequent across the 
4 megabases (Mb) of the MHC region [9-12]. In particu- 
lar, the DR3-B8-A1 extended haplotype is a long-range, 
common haplotype that is associated with type 1 dia- 
betes risk [10-13,20]. Chromosomes of this haplotype 
have a high level of allele identity across the MHC 
region with up to 9 Mb of near identity among chromo- 
somes from apparently unrelated individuals [13]. 

Congruence quantifies allele identity across a large 
number of contiguous SNPs within a group of chromo- 
somes. In this context, chromosome refers to a specific 
phased chromosome from an individual person, as 
represented by a set of SNPs. Here we apply ExHap to 
study congruence on 9,280 founder chromosomes geno- 
typed for 2,837 SNPs across the MHC on chromosome 
6p21. We also employ a sliding window to compute 
rolling short range congruence both within the MHC 



region and across the genome. Using these approaches, 
we compare overall congruence analyzing 1,818 SNPs in 
over 100 extended haplotype groups (defined by HLA- 
DQB1, HLA-DRB1, HLA-B and HLA- A alleles) from 
T1DGC data, illustrate statistically significant differences 
in congruence between haplotype groups, and inspect 
unrelated HapMap populations for congruence regions 
in a dataset of nearly 1.4 million SNPs across the 
human genome. Additionally, we compare the features 
and numeric results of ExHap to those of GERMLINE 
and Sweep, two other tools for identifying features of 
haplotype matrices. 

Methods 

Study populations and genotyping 

This analysis included 2,300 affected sibling pair families 
(10,012 individuals typed for HLA and/or SNPs) from 
the Type 1 Diabetes Genetics Consortium (T1DGC), 
using the 2007.11.MHC data freeze [21]. Affected sibling 
pairs and their parents were enrolled in 9 cohorts 
worldwide. Within the analyzed cohorts of Asia-Pacific, 
Europe, North America, UK, British Diabetes Associa- 
tion (BDA), Danish, Human Biological Data Interchange 
(HBDI), Joslin and Sardinian, 99% of individuals are 
classified as white/Caucasian or unknown. The T1DGC 
performed basic quality control analyses on the data. All 
study participants or their parents/surrogates provided 
written informed consent to participate, and the study 
protocol was approved by the relevant Ethics Commit- 
tees and Institutional Review Boards. 

Genotyping was completed for 3,072 SNPs at the 
Wellcome Trust Sanger Institute, using two custom Illu- 
mina mapping panels [2957 distinct SNPs (1536 SNPs in 
each panel with 115 overlapping SNPs) with 2837 of 
2957 SNPs successfully typed, yielding a 96% SNP suc- 
cess rate]. SNP positions ranged from 29.3 Mb to 34.2 
Mb from the telomere, covering a range of approxi- 
mately 4.9 Mb. SNPs were selected by the T1DGC using 
a tagging approach to account for the linkage disequili- 
brium in the region [21]. In addition, complete 4 digit 
HLA typing (HLA-DPB1, HLA-DPA1, HLA-DQB1, HLA- 
DQA1, HLA-DRB1, HLA-B, HLA-C, and HLA- A), per- 
formed using immobilized probe linear arrays, was avail- 
able for all samples [22]. 

Data from the International HapMap Project (Hap- 
Map3, Release 2) was also used in this study. We stu- 
died 1,387,466 SNPs across the 22 autosomes. Phased 
founder chromosomes, as represented by SNPs across 
the length of the chromosomes, were used from both 
CEU (N = 176, CEPH, Utah residents with ancestry 
from northern and western Europe) and YRI (N = 200, 
Yoruba from Ibadan, Nigeria). For the Yoruba, donors 
were required to have four Yoruba grandparents. The 
criteria used to assign membership in the CEPH 
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population have not been specified except that all 
donors were residents of Utah. 

Data processing 

SNP positions used NCBI Build 36. T1DGC chromo- 
somes (as represented by 2,837 SNPs in the approxi- 
mately 4.9 Mb region located 29.3 Mb to 34.2 Mb from 
the telomere) were generated from SNP and HLA geno- 
type data using multiple software packages. First, to 
establish that the genotype data demonstrated a Mende- 
lian inheritance pattern within each family, the Ped- 
Check program [23] was used on data from both 
Illumina panels and HLA separately. Mendelian inheri- 
tance patterns were present for all families. Next, data 
from the each of the two Illumina mapping SNP panels 
and the HLA panel were combined using custom Java 
programs. Merlin software [24] was used to phase the 
SNP and HLA genotype data from families into chro- 
mosomes. In situations of ambiguous phase (resulting 
from heterozygous SNPs or HLA in all family members), 
phase was not inferred. Instead, unphased alleles were 
labeled as such. Founder chromosomes were used in 
these analyses, yielding 4 unique chromosomes per 
family, for a total of 9,280 founder chromosomes. 
AFBAC (affected family based control) methodology was 
used to assign case or control status to founder chromo- 
somes using Microsoft Excel macros as previously 
described [8,20,25,26]. 

Haplotype groups 

We analyzed founder MHC chromosomes from the 
T1DGC dataset, as represented by 1,818 SNPs ranging 
from HLA-DRB1 to HLA- A (2.64 Mb). We identified 
132 groups of chromosomes having at least 10 chromo- 
somes per group and identical HLA-DQB1, HLADRB1, 
HLA-B, and HLA- A alleles, hereafter called a DQ-DR-B- 
A "haplotype group." HLADRB1 and HLA-DQB1 are in 
strong linkage disequilibrium; therefore in this manu- 
script we refer to certain DRB1-DQB1 allele pairs by the 
DRB1 allele only. Hereafter, DRBP01DQBP05 will be 
referred to as DRB1*01, DRBP03-DQBP02 as DRBVm, 
DRB1*04DQBP0302 as DRB1*04, DRBP01-DQBP02 as 
DRBP01, DRBP08-DQBP04 as DRBP08, DRBTll- 
DQBP0301 as DRBP11, DRBP12-DQBP0301 as 
DRBP12, DRBP15-DQBP0602 as DRBP15, and 
DRBP16-DQBP05 as DRBP16 unless otherwise 
specified. 

Algorithm and software 

Given a set of chromosomes represented by a range of 
contiguous SNPs, ExHap derives a single consensus 
string that captures the commonalities among chromo- 
somes. It also identifies congruent chromosomes (those 
chromosomes that are identical to the consensus string 



for multiple overlapping blocks for at least 20 out of 
every 30 contiguous SNPs) and the percentage of allele 
identity between each chromosome and the consensus 
string. ExHap has two main parts: deriving a consensus 
string and filtering out chromosomes that do not match 
this string. The program (1) derives a consensus string 
by iteratively (a) positioning a derivation window of 
length W; (b) computing the most frequent string of 
length W within the derivation window; (c) appending 
the first N alleles of this most frequent substring to the 
consensus string; and (2) filters out chromosomes that 
do not match the emerging consensus sequence by (a) 
positioning the filter window; (b) within the filter win- 
dow, checking SNPs from each chromosome for identity 
with the consensus sequence; and (c) removing noncon- 
gruent chromosomes from further consideration based 
on a matching rule for allele identity. A flow chart and 
simplified example is shown in Figures 1 and 2. 

This algorithm detects allele identity across a large 
number of contiguous SNPs. The derivation window 
size, as specified by W, provides a "look ahead," favoring 
the selection of sequences that are identical beyond the 
recorded substring of length N. Commonly, we set W = 
30 and N = 10. Our standard, but configurable, match- 
ing rule is that 20 of every 30 contiguous SNPs must 
match the consensus string. Unknown/unphased SNPs 
are not counted as mismatched. However, chromosomes 
with a large number of unknown/unphased SNPs can be 
excluded from analysis. A detailed description of the 
program and the userconfigurable parameters can be 
found in Additional file 1. ExHap is written in Java and 
is freely available as Additional file 2. 

GERMLINE and Sweep 

Using GERMLINE version 1.4.2 [27], we generated pair- 
wise shared segments over 3 different regions on 3 dif- 
ferent chromosomes for both CEU and YRI cohorts 
from HapMap data as detailed above. We set minimum 
segment length (min_m) to 0.75 Mb. Additionally, we 
used the option to match segments on haplotypes rather 
than genotypes (h_extend). Using Sweep version 1.1 
[17], we characterized the above-mentioned chromoso- 
mal regions with the ExportEHHvsFreqData batch com- 
mand. We configured Sweep to compute EHH for each 
core haplotype at 600 Kb (MatchAt = Distance 600 Kb). 
Cores were configured using the default criteria of long- 
est non-overlapping haplotype blocks as defined by Gab- 
riel [16], with block length running from 3 to 20 SNPs. 

Recombination rates 

To compare congruence to recombination rates, we 
used HapMap recombination rates, estimated from 
phased haplotypes in HapMap Release 22 (NCBI Build 
36). These rates were computed from the pooled 
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1. Derive consensus string 

a. Position derivation window 

b. Compute most frequent string 

c. Add first part of most frequent string to 
emerging consensus 



2. Filter out non-matching 
chromosomes 

a. Position filter window 

b. Check SNPs from each chromosome 
for identity with consensus 

c. Remove non-congruent chromosomes 
based on matching rule 




M 1 


End of existingN. No 
< consensus ;> 



string? 



No 



End of haplotype 
matrix? 



Report 
results 



Figure 1 Flow chart of ExHap algorithm. This figure provides a 
high-level conceptual overview of the congruence finding 
algorithm. First, a consensus string is identified for a short section of 
the chromosomes under consideration. Second, chromosomes 
which do not match this consensus are eliminated from further 
consideration. The process repeats until the end of the overall 
region is reached 



populations of the CEU, YRI, and JPT + CBT. For each 
250 SNP region for which we computed congruence, we 
also identified the maximum recombination rate. For 
chromosome 10, we plotted the percent congruence ver- 
sus this maximum recombination rate, identifying multi- 
ple regions in which both congruence rates and 
recombination rates were relatively high. 

Identification of known genes and associated KEGG 
pathways in regions of candidate extended haplotypes. 

For each candidate extended haplotype listed in Addi- 
tional file 3: Table S2 (e.g. Cohort YRI, chromosome 1, 
starting position 45708938, ending position 46703019), 
we queried the UCSC Genome Browser database [28] 
http://genome.ucsc.edu through the MySQL interface, 
issuing queries using the following template: "select dis- 
tinct 'YRF, chrom, txStart, geneSymbol, description from 



hgl8.knownGene k, hgl8.kgXref x where k.name = x. 
kgID and txStart > 45708938 and txEnd < 46703019 
and chrom = 'chrl' order by txStart;" For each unique 
resulting gene, we then issued a query against the 
KEGG pathway tables using the following query tem- 
plate: "select distinct km.description from hgl8.known- 
Gene k, hgl8.kgXref x, hgl8.keggMapDesc km, hgl8. 
keggPathway kp where k.name = x.kgID and x.kgID = 
kp.kgID and kp.mapID = km.mapID and geneSymbol = 
'AARSD1V. 

Statistical analysis 

The Wilcoxon signed-rank test was used to compare 
congruence percentages between groups across chromo- 
somal ranges, with a significance level of a = 0.05. All 
statistical computations were performed using the R 
Project for Statistical Computing [29]. Heat map style 
allele identity plots were generated with the heatmap.2 
package in R. 

Linkage disequilibrium plots, haplotype blocks, and 
associated metrics were generated with Haploview ver- 
sion 4.2 [15]. Blocks were determined using the default 
Gabriel et al. criteria, ignoring pairwise comparisons of 
markers greater than 500 kb apart [16]. 

Results 

Differing levels of congruence across haplotype groups in 
the T1 DGC 

We inspected phased HLA and SNP data on chromo- 
some 6, spanning the range HLA-DRB1 to HLA-A (32.6 
Mb to 30.0 Mb, 1818 SNPs across 2.64 Mb). Hereafter 
this data will be referred to as a chromosome. We strati- 
fied the T1DGC MHC founder chromosomes into hap- 
lotype groups defined by identity of DQ-DR-B-A as 
described in the methods. We restricted analysis to hap- 
lotype groups containing at least 10 chromosomes, 
resulting in 132 HLA "identical" groups identified for 
further study. Using ExHap, we identified congruent 
chromosomes for each of these haplotype groups. 
Details of the algorithm are provided in the methods 
and illustrated in Figures 1 and 2. Briefly, ExHap identi- 
fies the most common string of SNPs over a small win- 
dow or subrange of, for example, 30 SNPs. Then, 
ExHap slides the window a smaller number of SNPs (e. 
g. 10) and repeats the process, thereby deriving a con- 
sensus string for the entire range. Additionally, as the 
consensus string emerges, ExHap scores each chromo- 
some for identity with the emerging string. Chromo- 
somes that do not sufficiently match the consensus are 
eliminated from further consideration. 

Figure 3A illustrates congruence for each DQ-DR-B-A 
haplotype group. Our results identify 33 haplotype 
groups with over 50% congruence. This means that at 
least half of their chromosomes are nearly identical in 
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First Iteration 

1. Derive consensus string 

a. Position derivation window of width W (W=9 in this example) 



haplotype matrix 




b. Compute most frequent string 

c. Add first N SNPs (N = 3 in this example) to consensus 



most frequent I A AACCCTTTi 



emerging consensus | A A A| 



2. Filter out chromosomes that do match the consensus string based on rule. 
In this example, at least 2 of 3 SNPs must match. 
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Position filter window 


SNPs 1 -3 emerging consensus 
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Second Iteration 

1. Derive consensus string 

a. Position inspection window of width W, offset from previous location 
by N SNPs (N = 3 in this example) 



b. Compute most frequent string 

c. Add first N SNPs to consensus 




most frequent 



emerging consensus I A A A C C C 



2. Filter out chromosomes that do match the consensus string based on rule. 

a. Position filter window, sliding by 1 SNP on each pass SNPs 2-4 emerging consensus 

b. Check SNPs from each chromosome for identity with consensus chrl 

chr2 
chr3 i 

c. Remove non-congruent chromosomes ► ehr4 

chr5 
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Final result 

50% (3/6) Congruence, 97% Average Allele Identity. 



% Allele Identity 

100% 
92% 
100% 



AAAC CCTTTGGG 




Figure 2 An illustration of two iterations of ExHap. This figure illustrates the first two iterations and the final result of ExHap on a simplified 
example. Each iteration consists of 2 parts: derivation of the consensus string and filtering of chromosomes for congruence. In Iteration 1, Part 1, we 
compute the most frequent sequence or string of 9 SNPs from our haplotype matrix. Then we add the first 3 SNPs from this string to the emerging 
consensus string. In Part 2, we filter the chromosomes based on whether or not each matches the consensus string using a matching rule. In this 
example, the matching rule is that at least 2 of 3 contiguous SNPs must match. SNPs that do not match are highlighted in gray. In Iteration 2, Part 1, 
the position of the derivation window is adjusted by 3 SNPs. In Part 2, we compare SNPs 2-6 of each chromosome to SNPs 2-6 of the consensus by 
inspecting 3 SNPs at a time. Thus, we must inspect SNPs 2-4, 3-5, and 4-6. The processes described here are repeated over the entire haplotype matrix. 
The final result shows the congruent chromosomes and the percentage of allele identity for each chromosome. The size of the derivation window 
and the number of SNPs to record are user-configurable. In this example, the size of the derivation window is 9 and the number of SNPs to record is 
3. The size of the filter window, the number of SNPs that must match in this window, and the window offset are also user-configurable. In this 
example, the size of the filter window is 3, the number of matching SNPs is 2, and the offset is 1 
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Congruence from HLA-DRB1 to HLA-A 
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Figure 3 Congruence is widespread across the MHC and is not limited to the DR3-B8A1 haplotype. A: Percent congruence of 132 
extended DQ-DR-B-A haplotype groups from HLA-DRB1 to HLA-A. Plotted is the percent of chromosomes (both cases and controls) within each 
HLA haplotype group that meet the rules for congruence from HLA-DRB1 to HLA-A. The number of chromosomes in each group ranged from 8 
to 747 (mean = 34, median = 17), and specific details for each haplotype group are given in Additional file 4: Table S1. Arrows identify 
haplotype groups that will be discussed further in the paper (number = DR.BA: 7 = 3.8.1, 12 = 7.44.23, 129 = 3.7.2). B: Examples of congruence 
within two haplotype groups. In these two images, each column represents one founder chromosome (represented by 1,818 SNPs across 2.64 
Mb ranging from HLA-DRB1 to HLA-A on chromosome 6) with the specified HLA alleles (e.g., DR7, B44 and A23). Each row illustrates data for one 
SNP. Yellow represents alleles that match the consensus sequence, whereas blue represents alleles that do not match the consensus. White 
denotes missing or unphased alleles. Congruent chromosomes are indicated by a tick mark at the bottom of the image. The DR7-B44-A23 
haplotype group exhibits extreme congruence from HLA-DRB1 to HLA-A, seen by the high frequency of chromosomes that are congruent for the 
entire region (yellow from HLA-DRB1 to HLA-A). In contrast, DR3-B7-A2 shows an example of a haplotype group in which only one of the 
chromosomes is congruent across the region. Because the consensus chromosome is derived from a set of chromosomes, it is not necessarily 
an exact match to any particular chromosome. Thus the one congruent chromosome in this haplotype group does not have 100% allele identity 
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the region between HLA-DRB1 and HLA-A. In particu- 
lar, the common 3.8.1 (DR3-B8-A1) haplotype is highly 
congruent from HLA-DRB1 to HLA-A [91.7% (685/747 
chromosomes)]. Across 132 MHC haplotype groups, 
average allele identity for congruent chromosomes 
within a haplotype group varies from 96.5% to 99.9%, 
with a median of 99.5%, over 2.64 megabases. Detailed 
congruence results for all haplotype groups are included 
in Additional file 4: Table SI. Two examples of chromo- 
some level patterns of allele identity are shown in Figure 
3B, the highly congruent DR7-B44-A23 (23/28 chromo- 
somes, 82% congruent from HLA-DRB1 to HLA-A) and 
the less congruent DR3-B7-A2 (1/26 chromosomes, 4% 
congruent). In these two images (what we will call allele 
identity plots), each column represents one founder 
chromosome (as represented by 1,818 SNPs across the 
2.64 Mb range) with the specified HLA alleles (e.g., 
DR7, B44 and A23) and each row represents one SNP. 
Yellow represents alleles that match the consensus 
sequence whereas blue represents alleles that do not 
match the consensus. Eighty-two percent of the DR7- 
B44-A23 chromosomes are congruent, while only one of 
the DR3-B7-A2 chromosomes is congruent. By defini- 
tion, at least 1 chromosome must be congruent, so this 
represents the lower limit of congruence. Chromosomes 
congruent for the entire region are shown on the left as 
indicated by the tick mark below the chromosome. 
ExHap generates output preformatted for these "heat 
map" style graphics. 

Rolling congruence in the T1DGC 

In addition to examining congruence across a specific 
range of SNPs (HLA-DRB1 to HLA-A), we inspected 
"rolling congruence." Figure 4A shows rolling congru- 
ence of two different haplotype groups, DRB1*03-B*08 
and DRB1*04Q1-B*08 (DQBP0302). We computed con- 
gruence for contiguous sliding windows of 250 SNPs, 
with each window offset from the previous window by 
50 SNPs (e.g., SNPs 1-250, 50-300, 100-350 etc.), result- 
ing in 55 observations across the MHC. The percentage 
of congruent regions (as defined by the 250 SNPs under 
consideration) is plotted for each window. As can be 
seen in Figure 4A, the regions on DRB1*03-B*08 chro- 
mosomes are significantly more congruent from HLA- 
DRB1 to HLA-B than the regions on DRB 1* 0401 -B* 08 
(DQB 1*0302) chromosomes even though each group is 
fixed for both HLA-DRB1 and HLA-B alleles (p = 
0.0002, Wilcoxon signed-rank test, mean congruence = 
81.5% for DRB 1*03-B*08 compared to mean congruence 
= 63.8% for DRBr0401-B*08). These differences can be 
visualized by examining the allele identity plots for these 
two groups in Figure 4B (same color coding as Figure 
3B). For the most part, the DRB1*03-B*08 chromosomes 
are extremely congruent from HLA-DRB1 to HLA-B 



(96% of chromosomes are congruent), with 65% (719/ 
1103) of the chromosomes congruent all the way from 
HLA-DRB1 to HLA-A, The DRB i*0401-£* 08 
(DQB1 *0302) chromosomes, on the other hand, are con- 
gruent in the central region, near HLA-B, but are not 
congruent near HLA-DRB1 or HLA-A. We can see from 
these examples that fixing HLADRB1 and HLA-B alleles 
is not sufficient to guarantee congruence, and that the 
DRB1*03B*08 chromosomes are strikingly more congru- 
ent than the DRB 1* 0401 -B* 08 (DQB 1*0302) chromo- 
somes across the analyzed regions. 

Rolling congruence in HapMap 

Motivated by results from the MHC region, we applied 
ExHap to family-phased data from two HapMap popula- 
tions [CEU (CEPH, Utah residents with ancestry from 
northern and western Europe) and YRI (Yoruba from 
Ibadan, Nigeria)] across the 22 autosomes (Additional 
file 5: Figure SI). We computed congruence for contigu- 
ous sliding windows of 250 SNPs, with each window off- 
set from the previous window by 50 SNPs, resulting in 
27,652 observations across the genome. In this case, 
chromosomes are represented by 20,085 to 116,415 
SNPs (chromosome 22 and 1, respectively). Congruence 
metrics are based on 250 contiguous SNPs. As expected, 
congruence in the CEU cohort is higher than in the YRI 
cohort, with the CEU congruence exceeding YRI con- 
gruence for 72% of the observations (19,914/27,652). 
Furthermore, for the CEU, the 99 th percentile of these 
observations was 51.1%, as compared to 27.5% for YRI. 
To identify candidate extended haplotypes, we looked 
for two or more contiguous observations that exceeded 
the 99 th percentile for the respective population. We 
identified 51 such candidate extended haplotypes in 
CEU and 55 in YRI. These candidate extended haplo- 
types are at least 300 SNPs in length, and are detailed in 
Additional file 3: Table S2. 

To provide a bit more detail and context about these 
candidate extended haplotypes, we identified genes con- 
tained in the regions and associated pathways. For each 
cohort, for each candidate extended haplotype, we quer- 
ied the UCSC Genome Browser database to first identify 
known genes in these regions. Then, for each cohort, 
for each gene, we queried the database to identify 
KEGG pathways with which the gene is associated. We 
identified 900 distinct known genes in the candidate 
extended haplotype regions of the CEU cohort, and 
1043 in the YRI cohort. In the CEU data, the 900 genes 
resulted in a total of 428 hits in the KEGG pathways 
across 134 distinct pathways. In the YRI data, the 1043 
genes resulted in 750 hits in the KEGG pathways across 
165 distinct pathways. All resulting genes are listed in 
Additional file 6: Table S3. All resulting pathways, 
together with the number of associated genes, are listed 
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Rolling Congruence by Base Position 

DRB1*03-B*08 o 
DRB1*0401-B*08 (DQB1*0302) • 




DRB1*03-B*08 
1 103 Chromosomes 
1818 SNPs, 2.64 Mb 
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Figure 4 Comparison of two haplotype groups from the T1DGC. A: Congruence was calculated separately for two haplotype groups, 
DRB 7*03-5*08 and DRB 7*040 1-8*08 {DQB 7*0302). There is a significant difference in the congruence between the two groups centromeric of 
HLA-B, while the congruence within the two groups is not significantly different telomeric of HLA-B. B: In these two images, each column 
illustrates one founder chromosome (represented by 1,818 SNPs across 2.64 Mb ranging from HLA-DRB1 to HLA-A on chromosome 6) with the 
specified HLA alleles (e.g., D/?87*03 and 8*08). Each row illustrates data for one SNP. Yellow represents alleles that match the consensus 
sequence, whereas blue represents alleles that do not match the consensus. White denotes missing or unphased alleles. Congruent 
chromosomes are indicated by a tick mark at the bottom of the image. The D/?87*03-£*08 chromosomes are highly congruent across the entire 
region from HLA-DRB1 to HLA-B, and in many cases from HLA-DRB1 to HLA-A. On the other hand, the DRB1*04-B*0S (DQ£7*0302) chromosomes 
are congruent around HLA-B but are not congruent near HLA-DRB1 or HLA-A 
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in Additional file 7: Table S4. The pathways with which 
10 or more genes from the YRI cohort are associated 
are summarized in Table 1. 

Figures 5, 6, and 7 show rolling congruence, allele iden- 
tity, and linkage disequilibrium plots for three represen- 
tative candidate extended haplotypes. The region 
highlighted on chromosome 2 shows a high congruence 
peak for CEU only (Figure 5), the region on chromo- 
some 8 for both CEU and YRI (Figure 6), and the region 
on chromosome 10 for YRI only (Figure 7). As can be 
seen from the allele identity plots for each of these 
regions, high levels of congruence correspond to high 
levels of allele identity as indicated by the yellow regions 
in the allele identity plots. The peak on chromosome 2 
(Figure 5), where the CEU chromosomes are more con- 
gruent than the YRI chromosomes, includes the lactase 
gene (LCT). Strong recent positive selection has been 
reported for a large region that includes LCT [17-19]. 
The peak on chromosome 8 is characterized by both 
high CEU and high YRI congruence, with the CEU and 
YRI consensus sequences matching for 367 of 370 SNPs 
(99%, Figure 6). The peak on chromosome 10 demon- 
strates high YRI congruence but lower CEU congruence 
(Figure 7). Interestingly, for this chromosome 10 region, 
the YRI population shows 44.5% congruence with a 
median multi-allelic D' of 0.75 (range 0.45-1.0) across 
12 blocks. In contrast, the CEU population shows lower 
congruence at 9.1% but a higher median multi-allelic D' 
of 0.87 (range 0.33-0.98) across 13 blocks. Thus, congru- 
ence and haplotype block-associated metrics capture dif- 
ferent features of the region. 

Comparison of ExHap to GERMLINE and Sweep 

We compared the features and results of ExHap to 
those generated by GERMLINE [27] and Sweep [17]. All 



Table 1 Extract of Pathway Table. 



Pathways 


CEU Hits 


YRI Hits 


Olfactory transduction 


11 


56 


Systemic lupus erythematosus 


15 


49 


Metabolic pathways 


36 


47 


MAPK signaling pathway 


12 


20 


Regulation of actin cytoskeleton 


9 


14 


Cytokine-cytokine receptor interaction 


2 


13 


Endocytosis 


5 


11 


Cell cycle 


9 


10 


Glycolysis/Gluconeogenesis 


3 


10 


Antigen processing and presentation 


2 


10 


Chemokine signaling pathway 


1 


10 


The following pathways were associated with 10 or more genes found in 



candidate extended haplotypes of the YRI cohort. In all cases shown here, 
there are more associated genes in the YRI cohort than in the CEU cohort. 
This is an extract of Additional file 7: Table S4 



three approaches identify patterns of similarity in sets of 
phased chromosomes as defined by a series of SNPs. 
GERMLINE was designed to identify hidden relatedness 
as defined by long nearly identical regions shared by 
two chromosomes. Sweep was designed to identify 
regions of recent positive selection. ExHap was designed 
to identify both a consensus sequence and to score indi- 
vidual chromosomes for a configurable level of matching 
with that consensus. Because the three approaches were 
intended to highlight different characteristics of haplo- 
type matrices, comparisons among them should be 
interpreted with caution. That said, Table 2 summarizes 
selected features of the approaches while Table 3 pre- 
sents numeric results. 

ExHap handles missing data (unknown or unphased), 
while Sweep and GERMLINE require complete phasing, 
which can be inferred using programs such as BEAGLE 
and PHASE [30-33]. Both ExHap and GERMLINE allow a 
small number of mismatches. In part because of this toler- 
ance for noise both ExHap and GERMLINE produce 
meaningful numeric results over longer distances than 
does Sweep. Because Sweep requires exact matches by 
design, Extended Haplotype Homozygosity (EHH) deterio- 
rates with diversity, which all other factors being equal, is 
likely to increase with distance. Both GERMLINE and 
Sweep are able to compute and report results in either 
genetic distance or physical distance, while the current 
implementation of ExHap reports physical distance only. 
ExHap reports both aggregate region-wide metrics and 
chromosome level detail. In contrast, GERMLINE reports 
segments that are shared by pairs of chromosomes, while 
Sweep reports metrics at a core level. 

Table 3 summarizes results from all 3 approaches for 
the regions on chromosomes 2, 8 and 10 illustrated in 
Figures 5, 6 and 7. To compare GERMLINE to ExHap, 
we used GERMLINE to compute the number of shared 
segments with a length of at least 0.75 Mb in the speci- 
fied regions. Then, to account for the different number 
of chromosomes in the CEU and YRI data sets, we 
divided the total number of shared segments by the 
number of chromosomes to derive the average number 
of shared segments per chromosome. For the region on 
chromosome 2, both Percent Congruent and Average 
Number of Shared Segments are notably larger for CEU 
than YRI. For the region on chromosome 8, the Average 
Number of Shared Segments for CEU and YRI are rela- 
tively close, as are the Percent Congruent measure- 
ments. For the region on chromosome 10, the difference 
in Percent Congruent between CEU and YRI is much 
more dramatic than the difference in Average Number 
of Shared Segments. However, upon inspection of Figure 
7, this is a relatively noisy region, which is reflected by 
the generally low average number of shared segments 
reported by GERMLINE. 
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Chromosome 2 Rolling Congruence 

CEU 

YRI 




T 

140 



Distance from Telomere (Mb) 



CEU LD Plot (left) and Allele Identity (right) 
391 SNPs, 1.05 Mb 




YRI LD Plot (left) and Allele Identity (right) 
391 SNPs, 1.05 Mb 




Figure 5 Comparing unrelated HapMap populations: chromosome 2. This figure shows rolling congruence over a section of chromosome 
2. Detailed linkage disequilibrium (LD) plots with haplotype blocks and allele identity plots correspond to the shaded peak. The two horizontal 
reference lines on the rolling congruence plot correspond to the 99 th percentile of congruence observations across the genome. That value is 
51.1% for CEU (solid line) and 27.5% for YRI (dotted line). On the LD plots, shades of red indicate increasing D'. For the allele identity plots, each 
column is one chromosome (as represented by 391 SNPs and 1.05 Mb) and each row is a SNP. Yellow represents alleles that match the 
consensus sequence whereas blue represents alleles that do not match the consensus. Congruent chromosomes are indicated by the tick mark 
at the bottom of the allele identity plot. This peak on chromosome 2 represents a region where CEU chromosomes are more congruent than 
YRI chromosomes and is near the /.CTgene, previously implicated as a target of recent positive selection 
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Chromosome 8 Rolling Congruence 

CEU 

YRI 
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Distance from Telomere (Mb) 




YRI LD Plot (left) and Allele Identity (right) 
370 SNPs, 0.96 Mb 
% Congruent=70, % Allele_ldentity=98_ 




Figure 6 Comparing unrelated HapMap populations: chromosome 8. This figure shows rolling congruence over a section of chromosome 
8. Detailed linkage disequilibrium (LD) plots with haplotype blocks and allele identity plots correspond to the shaded peak. The two horizontal 
reference lines on the rolling congruence plot correspond to the 99 th percentile of congruence observations across the genome. That value is 
51.1% for CEU (solid line) and 27.5% for YRI (dotted line). On the LD plots, shades of red indicate increasing D'. For the allele identity plots, each 
column is one chromosome (as represented by 370 SNPs and 0.96 Mb) and each row is a SNP. Yellow represents alleles that match the 
consensus sequence whereas blue represents alleles that do not match the consensus. Congruent chromosomes are indicated by the tick mark 
at the bottom of the allele identity plot. This peak on chromosome 8 is characterized by both high CEU and high YRI congruence 
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Chromosome 10 Rolling Congruence 

CEU 
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CEU LD Plot (left) and Allele Identity (right) 
418 SNPs, 1.11 Mb 








YRI LD Plot (left) and Allele Identity (right) 
418 SNPs, 1.11 Mb 




Figure 7 Comparing unrelated HapMap populations: chromosome 10. This figure shows rolling congruence over a section of chromosome 
10. Detailed linkage disequilibrium (LD) plots with haplotype blocks and allele identity plots correspond to the shaded peak. The two horizontal 
reference lines on the rolling congruence plot correspond to the 99 th percentile of congruence observations across the genome. That value is 
51.1% for CEU (solid line) and 27.5% for YRI (dotted line). On the LD plots, shades of red indicate increasing D'. For the allele identity plots, each 
column is one chromosome (as represented by 418 SNPs and 1.1 1 Mb) and each row is a SNP. Yellow represents alleles that match the 
consensus sequence whereas blue represents alleles that do not match the consensus. Congruent chromosomes are indicated by the tick mark 
at the bottom of the allele identity plot. This peak on chromosome 10 (near 103 Mb) demonstrates high YRI congruence but lower CEU 
congruence 

I ) 
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Table 2 Feature comparison of ExHap, GERMLINE and Sweep. 



rcdlUic 


cxridp 


fZCDIWII IMC 


jweep/ enn 


Requires phased chromosomes as input 


yes 


yes 


yes 


Handles missing data 


yes 


no 


no 


Allows small number of mismatches (fuzzy matching) 


yes 


yes 1 


no 


Designed to analyze a large number of SNPs (100 s to 1000 s) 


yes 


yes 


no 


Supports genetic distance 


no 


yes 


yes 


Outputs aggregate/region-wide metrics 


yes 


no 2 


yes 3 


Outputs chromosome level detail 


yes 


4 

yes 


no 



Seven features of starting data sets, algorithm logic, and resulting data are listed below. Each program is characterized based on the presence (yes) or absence 
(no) of the feature 

1 1nitial "seed" "haplotypes" are exact match. After that, a user-configurable number of mismatches is allowed 
2 Segment-level details (shared between 2 chromosomes) can be aggregated to describe region 

3 Metrics are reported at a "core" level, where a core is a particular haplotype within a haplotype block. These metrics can be aggregated across all cores within a 
region to describe the region 

4 Primary output is overlapping segments shared between 2 chromosomes, but report is at an individual level rather than chromosome level 



To compare Sweep to ExHap, we used Sweep to iden- 
tify the core haplotype blocks (as defined by Gabriel 
[16]) between 3 and 20 SNPs in length, and to calculate 
EHH for all haplotypes within each core block at a dis- 
tance of 600 Kb. Finally, we computed the average EHH 
of the 30 haplotype cores with the highest EHH. This 
provides some degree of normalization for the different 
number of chromosomes in the two cohorts. Hereafter, 
we will refer to this metric as AvgEHH-30. The para- 
meters of 600 Kb in distance and top 30 haplotype 
cores were selected after experimentation with a variety 
of parameter settings. By definition, EHH and thus 
AvgEHH-30 can range from 0 to 1. The smallest differ- 
ence in AvgEHH-30 between CEU and YRI is 0.06 for 
the region on chromosome 8. This is also the region 
with the smallest difference in Percent Congruent as 
derived by ExHap. The largest difference in AvgEHH-30 
is 0.19 for chromosome 10. YRI has both the larger Per- 
cent Congruent value and the larger AvgEHH-30. On 
chromosome 2, CEU has both the larger Percent 



Congruent value and the larger AvgEHH-30, though the 
difference in AvgEHH-30 is not as dramatic as the dif- 
ference in Percent Congruent. 

In summary, all three programs were designed to 
identify different phenomenon in haplotype matrices. 
Thus, the numeric results are not directly comparable. 
However, the discussion of features and results illus- 
trates the many ways in which chromosomal regions 
can be studied. 

Discussion 

We studied congruence across the MHC region using a 
large dataset from the Type 1 Diabetes Genetics Con- 
sortium (T1DGC) with HLA allele and SNP typing 
(2,837 SNPs) for both affected sibling pairs and their 
parents. We defined chromosomal phase with familial 
analysis. We note that there are 33 haplotype groups in 
the T1DGC dataset that are > 50% congruent (Figure 
3A). In addition, we investigated the genome -wide con- 
gruence of family-phased founder chromosomes for two 



Table 3 Numeric comparison of ExHap, GERMLINE, and Sweep. 









ExHap 


GERMLINE 


Sweep 




Range in 
base pairs 


Cohort 


Percent 
Congruet 


Avg # of Shared Segments per 
chromosome, 0.75 Mb min length 


Avg EHH of 30 Core Haplotypes with Highest 
EHH at 600 Kb From Core 


Chr2 


135460477- 
136506753 


CEU 


79.0 


19.1 


0.69 






YRI 


7.0 


2.8 


0.59 


Chr8 


99920891- 
100880665 


CEU 


72.7 


9.3 


0.90 






YRI 


70.0 


7.9 


0.84 


ChrlO 


103101047- 
104210566 


CEU 


9.1 


4.5 


0.51 






YRI 


44.5 


3.0 


0.70 



Three regions were selected based on patterns of congruence in HapMap CEU and YRI cohorts. Allele identity plots of the regions are presented in Figures 5, 6, 
and 7. GERMLINE results are expressed as average number of shared segments per chromosome to account for the difference in the number of chromosomes 
between CEU (n = 176) and YRI (n = 200) 
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cohorts from HapMap. We identified 51 candidate 
extended haplotypes in the CEU data and 55 in the YRI 
data. Thus, it is plausible that biologically interesting 
extended haplotypes are located across the genome. 

The MHC region contains many genes with known 
immunological function. When stratified by HLA alleles 
(DQ-DR-B-A), there are 132 haplotype groups across 
the MHC region. Within these groups, congruence 
ranges from 3% to 100%, with a mean of 32.8%. As 
demonstrated by these results, and as illustrated in Fig- 
ure 3, the haplotype group alone does not guarantee a 
high level of congruence in dense SNP data. 

We applied rolling congruence to two haplotype 
groups from the T1DGC (Figure 4). The DRB1*04<01- 
B*08 chromosomes are significantly less congruent than 
the DRB1*03-B*08 chromosomes across the region from 
HLA-DRB1 to HLA-B. This offers an additional line of 
evidence that simply fixing HLA-DRB1 and HLA-B 
alleles is not sufficient to drive congruence within any 
particular haplotype group. 

In the two HapMap populations analyzed, CEU and 
YRI, CEU exhibits higher congruence than YRI for 
72% of our observations. YRI is an older population 
compared to CEU, and thus we expect YRI to demon- 
strate greater genetic diversity and less linkage disequi- 
librium than CEU. Therefore, regions where YRI has 
higher congruence than CEU are of particular interest, 
as they may reflect important underlying biological 
effects, especially with respect to survival in the tro- 
pics. The observation that the candidate extended hap- 
lotypes in the YRI cohort contain more genes 
associated with known KEGG pathways may also be 
related to the importance of these regions in long-term 
population survival. In addition, the MHC region on 
chromosome 6p21 does not appear to be uniquely con- 
gruent, as other genomic regions demonstrate similar 
levels of congruence. 

High levels of congruence in regions across the gen- 
ome could be explained by several mechanisms. First, 
the regions identified could simply represent regions 
that exhibit decreased recombination and therefore 
decreased haplotype diversity. While some regions of 
high congruence are also associated with low recombi- 
nation, such as in the centromere, there are regions of 
high congruence and high recombination (Additional 
file 8: Figure S2). Second, the regions identified could be 
a result of strong negative selection. Third, the identified 
regions could be under recent positive selection. Regions 
of the genome that have been under recent positive 
selection are typically characterized by long-range haplo- 
types surrounding the mutation that provides the selec- 
tive advantage and attains high frequency rapidly. A 
known recent selective event at the LCT locus is asso- 
ciated with the increase in animal domestication and 



adult milk consumption [17-19,34]. Our CEU data also 
shows a congruence peak at this locus. 

Both haplotype block analysis and congruence analysis 
capture the patterns in haplotype matrices. Haplotype 
blocks work well for relatively small regions. When 
applied to candidate extended haplotypes of around 400 
SNPs, as shown in Figures 5, 6 and 7, haplotype blocks 
are less informative. We considered various metrics to 
summarize haplotype block related data across the 
regions, such as median D', median multiallelic D', num- 
ber of blocks, median block length, total possible var- 
iants, and percentage of informative SNPs, but were not 
satisfied with any single metric. In contrast, ExHap 
offers a single metric, the percentage of chromosomes 
that are congruent, to capture the commonalities across 
the region. Additionally, ExHap supports binary classifi- 
cation of individual chromosomes. Given a set of chro- 
mosomes, a consensus sequence is identified and 
individual chromosomes are classified as congruent or 
not. In comparison, the largest of the 8 haplotype blocks 
characterizing CEU chromosome 8 (Figure 6) contained 
13 variants of 154 SNPs, while the second largest con- 
tained 13 variants of 55 SNPs. 

Additionally, we compared selected features of ExHap 
to those of GERMLINE and Sweep. We also compared 
numeric results of the three programs for three chromo- 
somal regions in family phased HapMap data. All three 
programs were designed to identify different features of 
haplotype matrices associated with different underlying 
biological phenomenon. However, the feature compari- 
son in Table 2 and the numeric results in Table 3 pro- 
vide additional insight into some of the tools available 
for the study of patterns of similarity in haplotype 
matrices. 

There are limitations to our approach. One limitation 
is that the derived consensus sequence (and which chro- 
mosomes are considered "congruent") is dependent on 
the start site. An appropriate start site is particularly 
important for the long-range analysis. We determined 
start sites a priori (i.e., starting at HLA-DRB1 for the 
MHC region and continuing telomeric) or by inspecting 
larger allele identity plots (i.e., chromosome 2, 8 and 10 
peaks). For example, in Figure 6, if looking for long- 
range congruence, a start site of 95 Mb would identify 
relatively low congruence, but a start site of 99 Mb (and 
going to about 101 Mb) would identify relatively high 
congruence. This is less of a problem for the rolling 
analysis as long as the length of the extended haplotype 
to be identified exceeds the length of the region of 
inspection. For example, using our parameters, we iden- 
tify multiple extended haplotypes covering more than 
250 SNPs, but probably miss extended haplotypes that 
are shorter than 250 SNPs. Our suggested strategy is to 
first use rolling analysis, with overlap and a moderate 
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window (e.g. a 250 SNP window with a 50 SNP overlap), 
followed by long-range analysis once the appropriate 
range is identified. Another limitation of ExHap is that 
in a haplotype matrix with little congruence, only one 
chromosome in the matrix might be considered congru- 
ent (e.g., DR3-B7-A2 in Figure 3B). 

Successful identification of extended haplotypes based 
on high density SNP data demands the accommodation 
of a low level of mismatches, whether due to experimen- 
tal error (e.g., inaccurate genotype clustering) or point 
mutations. The ability of ExHap to accommodate 
experimental error might be particularly valuable in the 
analysis of data generated by genome-wide association 
studies, as the raw number of genotyping errors is more 
likely to increase when millions of SNPs are typed. 
Under our standard parameters, we require allele iden- 
tity for 20 of 30 SNPs across each overlapping scoring 
window for a chromosome to be considered congruent. 
At first glance, this appears to be a rather weak defini- 
tion of congruence. However, a chromosome must 
repeatedly and consistently match the emerging consen- 
sus string to be considered congruent. The actual allele 
identity within congruent chromosomes is typically very 
high (median 99.5% identity in our 132 T1DGC haplo- 
type groups). We believe this is due to both the underly- 
ing haplotypes (the haplotypes either "match" or "don't") 
and to the use of a "look ahead" window that empha- 
sizes longer haplotypes. That said, the matching rule is 
user-configurable. A more stringent matching rule (e.g., 
28 of 30) might be appropriate if point mutations on a 
specific haplotype are implicated in disease. In general, 
parameter settings should be tuned to accommodate the 
nature of the data set (e.g. number of SNPs, SNP den- 
sity) and the features under consideration. 

The look-ahead window that emphasizes length is one 
of the important features of ExHap. The emphasis on 
length, coupled with allele identity plots, allows us to 
explore a set of chromosomes beyond the limits of a 
candidate or known extended haplotype (e.g. HLADRB1 
to HLA-B), detecting patterns of congruence deteriora- 
tion such as those illustrated in Figure 4. Another 
important feature is the ability to specify a starting 
point for congruence computation, such as HLA-DRB1. 
This feature was important in our study of the MHC as 
it allowed us to incorporate prior knowledge of haplo- 
types defined by HLA alleles alone. 

ExHap was also designed to support the study of case 
and control chromosomes. The long-range linkage dise- 
quilibrium present on extended haplotypes confounds 
association analyses as specific risk alleles are difficult to 
localize given long-range linkage disequilibrium. How- 
ever, long-range haplotypes are useful in identifying 
common complex disease associations related to a speci- 
fic haplotype, where affected individuals share a long 



haplotype. By design, ExHap stratifies chromosomes 
into two classes, congruent and not congruent. Addi- 
tionally, the program further tallies the results by case 
and control. The resulting data can be compared to, for 
example, investigate the difference in congruence 
between the two groups. 

Furthermore, ExHap is a valuable tool for visualizing 
and quantifying patterns of similarity and differences in 
haplotype matrices. As such, it is useful for hypothesis 
generation. For example, by applying the tool to Hap- 
Map data, we identified regions that are highly congru- 
ent for both CEU and YRI, congruent for just CEU, or 
congruent for just YRI. Each of these categories is 
potentially interesting for different reasons. One might 
hypothesize that a region highly congruent for YRI but 
not CEU might be important for population health and 
survival in tropical Africa. A similar approach compar- 
ing congruence across disease cohorts might be of great 
value studying Immunochip Consortium data sets 
(nearly 200,000 SNPs, over 150,000 chips ordered in 
support of planned studies of diseases including rheu- 
matoid arthritis, T1D, and multiple sclerosis) [35]. 

Conclusions 

In conclusion, we illustrate that congruence is informa- 
tive for ranges of 250-2000 SNPs, both in the MHC and 
across the genome. A simple metric, the percentage of 
chromosomes that are congruent, easily highlights 
regions of interest for future research. In addition, bin- 
ary classification of individual chromosomes as congru- 
ent or not congruent facilitates subsequent analysis. 
Furthermore, we can study individual chromosomes in 
the context of extended haplotypes using allele identity 
plots. This could prove important in disease applications 
where an individual's specific mutation is of interest. 
Thus, the ability to identify extended haplotypes and to 
inspect individual chromosomes using congruence is 
potentially valuable for both population and disease 
genetics. 

Additional material 



Additional file 1: ExHap User's Guide. 

Additional file 2: ExHap Software Distribution. 

Additional file 3: Table S2. Congruence peaks may represent regions of 
extended haplotypes. This table includes chromosome, cohort, telomeric 
and centromeric positions identifying the limits of the peak region, the 
length of the region in megabases (Mb), and the length of the region in 
SNPs. For each cohort, the peak was defined as two or more contiguous 
ranges of 250 SNPs where congruence exceeded the 99 th percentile of 
congruence for that cohort (51.1% for CEU, 27.5% for YRI). 

Additional file 4: Table SI. Congruence of haplotype groups identified 
in the T1DGC data. Raw data for the bar chart in Figure 3A is shown, 
including percent of chromosomes that are congruent and the percent 
allele identity within those congruent chromosomes. 
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Additional file 5: Figure SI. Congruence across the genome. These 
figures illustrate rolling short range congruence across chromosomes 1 
through 22 for both CEU and YRI cohorts of the International HapMap 
Project. The horizontal reference lines (solid = CEH; dashed = YRI) show 
the 99th percentile of congruence (51.1% for CEU, 27.5% for YRI) for the 
respective cohorts. Overall, congruence was calculated for 27,652 
overlapping ranges of 250 SNPs. Centromeres are indicated by "cen" at 
the starting position. Other loci of interest are indicated by the name [e. 
g. LCT or UNK (unknown)] and a source in which they are discussed (e.g. 
Sabeti2007). 

Additional file 6: Table S3. Genes contained in candidate extended 
haplotypes. The genes listed in the table are contained in candidate 
extended haplotypes of either the CEU cohort and/or the YRI cohort, 
according to data compiled in the UCSC Genome Browser database. 
Some genes are listed multiple times due to differences in start position 
and/or description. 

Additional file 7: Table S4. KEGG Pathways associated with genes in 
candidate extended haplotypes. The KEGG pathways listed in the table 
are associated with genes found in candidate extended haplotypes of 
either the CEU cohort and/or the YRI cohort, according to data compiled 
in the UCSC Genome Browser database. "Hits" is the number of genes 
associated with each pathway. 

Additional file 8: Figure S2. For chromosome 10, the maximum 
recombination rate was identified for each 250 SNP range for which 
congruence was calculated. These recombination rates are plotted 
against the corresponding congruence rates for the YRI cohort. 
Reference lines indicate the 90th percentile of the congruence rates 
(11%) and the 10th percentile of the recombination rates (17%). The 
points in the upper right hand quadrant represent those regions for 
which both congruence and recombination are relatively high and 
potentially warrant further investigation. 
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